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Abstract. In a recent paper we have suggested that the finite temperature density matrix can be computed efficiently 
by a combination of polynomial expansion and iterative inversion techniques. We present here significant improvements 
over this scheme. The original complex-valued formalism is turned into a purely real one. In addition, we use Chebyshev 
polynomials expansion and fast summation techniques. This drastically reduces the scaling of the algorithm with the width 
of the Hamiltonian spectrum, which is now of the order of the cubic root of such parameter. This makes our method very 
competitive for applications to ab-initio simulations, when high energy resolution is required. 
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INTRODUCTION 

Several fields of computational science (nanotechnology, materials science or biochemistry just to name a few) would 
greatly benefit from the possibility of performing simulations of large systems, containing hundreds of thousands 
of atoms. Conventional electronic structure calculations require the diagonalization of matrices whose size is of 
the order of the number of electrons in the system. The ^ (A^^) -scaling cost of this step greatly limits the range of 
systems which can be tackled by ab-initio techniques, despite the fast-paced progress in the computational power 
of modern processors. Based on the theoretical foundations of the nearsightedness principle of electronic matter[l], 
several techniques have been developed in the last years to avoid the diagonalization step, by directly computing the 
density matrix of the system using linear scaling algorithms [2-7]. One of the earliest approaches have been to compute 
the finite temperature density matrix, i.e. the Fermi function of the system's Hamiltonian, H, by decomposing it into 
easier-to-compute functions of the Hamiltonian[8, 9], for instance Chebyshev polynomials or rational functions. In a 
recent paper[10] we have discussed an exact decomposition of the Fermi operator which can be efficiently computed 
by a combination of polynomial expansion and iterative inversion techniques. In this way, we achieved an efficient 
scaling with Ae, the width of the spectrum of the Hamiltonian in units of ksT. This makes the method attractive for 
applications to metals or low-band gap semiconductors at low electronic temperature. In this short paper we discuss a 
number of improvements to this scheme, which further lower the operation count, leading to a scaling °^ v^'Ae, which 
is, to the best of our knowledge, the lowest so far reported in literature. We will follow closely the scheme of our 
previous work[10], obtaining analytical estimates for the operation count of the different steps which compose our 
algorithm, so as to optimize them in order to achieve optimal performance. 

DETAILS OF THE DECOMPOSITION 

Our decomposition scheme is based on an exact expansion of the Fermi operator, which can be elegantly derived using 
the grand-canonical formalism[ll]. Here we only report the final result, namely the fact that the finite-temperature 
density matrix p can be written in terms of a sum of complex- valued matrices M/[12]: 



p = ^ = = - f 1 -ReM,-\ Ml = 1 _^i(2'-i)V2/'^-H/2/'^ 



(1) 



Equation (1) is exact for any value of P, but large values should be chosen so as to allow for simple and economical 
evaluation of the matrix exponential e^^/^^. Here and in the rest of the paper, with no loss of generality, we use a 
shifted and scaled Hamiltonian, i.e. we set the zero of energies at ji, and measure energy in units of ksT. 



The expensive step in applying Eq. (1) is the inversion of the M; matrices. In Ref.[10] we have shown that, for 
large values of P, a I exists such that all of the matrices with I > I are almost optimally conditioned, and can be easily 
inverted by a polynomial expansion. We have also shown how their overall contribution to the Fermi operator can 
be computed at once, at the same cost of computing a single term, making the computational burden of the method 
virtually independent of P. We will exploit this property and take often the large P limit to derive analytical results. 
The remaining, low-/ matrices are more efficiently treated with an iterative Newton inversion scheme[13]. Overall, 
the scaling in terms of matrix-matrix multipUcations count, which is «: Ae for standard Chebyshev polynomials 
expansion[14], reduces to °^ \fKe, and is even lower when fast polynomial summation methods[15, 16] are used. 

A first improvement over the initial formulation of our scheme can be seeked by noting that only the real part of 
M^' enters the expression for the Fermi operator, so that it can be more efficient to write 
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1 + {e^lP _ i) N,- , where N, = 1 + e^/^ - 2e»/2^cos K 
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Only real matrices are involved in Eq. (2), leading to substantial savings in memory requirements and computation 
time. The inversion of the N/s is the computationally demanding part of this approach; in fact, if one performs an 
analysis similar to the one we have carried out in Ref.[10], the condition number of N/ is found to be approximately 
K (N/) « 1 -h Ae2;r~2 {21 - 1 )"^. This is higher than (M/) « 1 +Aen' ' (2/ - 1 ^ but decreases more rapidly with /. 
The low-/ inverse matrices, which are worse conditioned, can be computed by Newton inversion, whose performances 
are only weakly affected by high condition numbers. 

Moreover, we take profit of the form of Eq. (2) to improve the evaluation of the contribution of the high-Z matrices, 
which in our previous work[10] had a computational cost scaling as Ae^/F. Since we also want to set up an expansion 
in Chebyshev polynomials for we rewrite N; it in terms of an auxiUary matrix X whose spectrum Ues between 
-1 andl: 



N; = C^x2-2C(^cos;r 
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Here we have introduced the shifting and scaUng parameters zo = -|- e^-/^^) /2 and ^ = (e^+Z^^ - e^-/^^) /2, 

computed in terms of the extremal values of the Hamiltonian spectrum, e and e+. These parameters can be estimated 
with a Lanczos procedure. Alternatively, since only a rough estimate is required, one can also perform a test calculation 
on a smaller, similar system, or use a matrix norm as an upper bound to the spectral radius of H. 

The inverse Nf can be therefore approximated as a sum of Chebyshev polynomials of X, N,"' « lULoCiTi (X), 
where 7^ is the i-th Chebyshev polynomial, and the coefficients c, can be computed, by straightforward if tedious 
algebra, and take the values 



Ci = 



Im 



(^b-Vb^sgnRcby , where /; = (e''^^ -zo) /C 



(4) 



An upper bound to the error due to truncation of the Chebyshev expansion can be estimated from Y^^m+i '^^^ 
estimate leads to a complex expression, which simplifies considerably if we assume — e_ = e+ = Ae. In this case, by 
taking the large P limit, that the number of terms to be included in order do reach 10^^ relative accuracy on N7 ' reads 



1 AeDlnlO 
2"*" 7r(2Z- 1)' 



(5) 



this can be used as a first guess, to be refined by explicitly summing up the |c, | as computed from Eq. (4). 

At variance with the polynomial expansion of Ref.[10], the operation count is linear with Ae. This is the same 
scaling observed when the Fermi operator is directly expanded in Chebyshev polynomials [14]. This analogy is not 
surprising, since the matrix powers Tt (X) entering the expansion are all independent of /, and one could in principle 
obtain the whole density matrix from a single Chebyshev polynomial evaluation, computing the expansion coefficients 
by summing over all the c, s. 

However, it is more convenient to stop the summation at / > 1 , so as to reduce mc, and tackle the remaining terms by 
iterative inversion. The practical recipe is therefore to compute the coefficients c, (/) and di = Y^I-jCi (l), using them 
to obtain Nj' ' and the contribution to the Fermi operator from the "tail" of matrices with / > /~, respectively. The next 
step is to compute the terms up to / = 1 by iterative inversion, a task which becomes much more efficient when a good 




FIGURE 1. Estimated number of required matrix-matrix multiplications needed in order to obtain 10 relative accuracy on 
the finite temperature density matrix, as a function of the Hamiltonian spectum range Ae. The grid lines correspond to a scaling 
oc -yAe. We also plot the operation count expected when applying the fast summation methods of Ref . [ 1 6] to the standard Cheby shev 

polynomial expansion[14] (m,o, = 2^ l)Ae). Data points correspond to a test performed on the ab-initio Hamiltonian for 
a LiAl alloy. Details can be found in Ref.[10]: Ae for the system is 3.00 a.u., and we have performed density matrix calculations 
at different temperatures, setting the target accuracy to 10^^. Next to each point, we also report the relative error on the band 
structure energy, as computed with our algorithm, taking as reference the value obtained by diagonalization at the same electronic 
temperature. 



initial guess for the inverse matrix is available. Such a guess can be obtained by using a finite number of terms in the 
extrapolation 
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N, 'o, = — ^— y N, ' 1 +£'"''^ (1 - 1/t?) , witli Tj^cos hsin tan— ^ -. (6) 
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The simplest approximation is already sufficient to guarantee convergence, which is fast and almost indepen- 

dent of the condition number of the N/'s. The number of multiplications needed to obtain a relative accuracy of 10^^ 
on ' starting from N^'j / Tj is in fact, in the large P limit. 



rriN = In — , where x « ^ . (7) 
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One can use N^^V'?' which has been computed with the Cheby shev expansion of the tail contribution, as the initial 
guess for the Newton iteration which gives N^^^^. The converged N^'j is in turn used to evaluate N^^^, and so on. 

The total number of matrix multiplications involved is easily obtained by combining Eq. (5) and (7), m,o, = 
niQ (Tj +L/=i (0- minimizing this quantity the optimal value of [is readily found. The Chebyshev polynomials 
can be computed with fast summation techniques, in which case the operation count for that part becomes m'^j = 3y^mc- 
In Fig. 1 we report our theoretical estimate for m,o( as a function of Ae. The scaling is °^ -sfKe, and the crossover point 
with fast-summed Chebyshev expansion is around Ae w 10^, which is easily reached if electronic temperatures of a 
few hundred K are necessary. In the same figure we also report the results from some test calculations performed for 
the Hamiltonian of a semimetallic LiAl alloy, together with the measured relative error on the total energy, which is 
always one order of magnitude smaller than the target value enforced when choosing a particular Chebyshev expansion 
length and convergence threshold for the matrix inversion. In order to keep the scheme as simple as possible, we have 
not considered the option of fine-tuning the procedure, which could further reduce the operation count. First, the 
operations count for Newton inversion (7) turns out to be often overestimated, so that hand-tuning f can save several 
multiplications. Higher-order extrapolations are easily obtained (see Appendix B of Ref.[10]), which provide a better 



starting point for iterative inversion. Some preliminary results we obtained performing molecular dynamics with forces 
computed on the fly with this scheme show that the number of multiplies needed for iterative inversion can be halved 
if one uses the inverse saved from the previous timestep as the initial guess. 

CONCLUSIONS 

In this short paper we have introduced significant improvements to an algorithm that tackles the problem of Fermi 
operator expansion by an hybrid approach based on a combination of polynomial expansion and iterative matrix 
inversion, which we have presented in a recent paper. With this improved implementation, only real- valued matrices 
are involved, leading to significant savings with respect to the previous, complex-valued formalism. Even more 
important, the scaling of the operation count with the Hamiltonian spectrum width is lowered from VAe to v^'Ae, 
which is extremely appealing for broad-spectrum or low electronic temperature problems. Work in the direction of a 
practical, linear-scaUng implementation of these ideas is in progress[17]. Beside the use of an efficient sparse-matrix 
algebra library, subtle issues regarding matrix truncation must be addressed, and several optimizations, such as the 
extrapolation of M^^ from previous timesteps discussed above, can be used to improve the performances of the 
algorithm. 
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